Goal :
The objective of this project is to build a decision tree classifier that accurately predicts whether a client will subscribe to a bank deposit based on their demographic and behavioral data. We aim to identify key factors influencing deposit subscriptions through our analysis then using that information to build our predictive model and provide actionable insights to improve marketing strategies.
About Dataset:
The dataset was provided by Prodigy InfoTech from the UCI Machine Learning repository. It contains information about clients and their interactions with the bank's marketing efforts which mostly includes demographic data such as age, job, marital status, education level etc and behavioral data such as past campaign success, contact methods etc. It is a well known dataset for predicting the success of bank marketing campaigns.
Link to the dataset : Bank
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import seaborn as sns
from plotly.subplots import make_subplots
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
df = pd.read_csv('C:/Users/obalabi adepoju/Downloads/bank.csv')
We'll look at a general overview of our data and a description of each column.
df.head(10)
| age | job | marital | education | default | balance | housing | loan | contact | day | month | duration | campaign | pdays | previous | poutcome | y | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 58 | management | married | tertiary | no | 2143 | yes | no | unknown | 5 | may | 261 | 1 | -1 | 0 | unknown | no |
| 1 | 44 | technician | single | secondary | no | 29 | yes | no | unknown | 5 | may | 151 | 1 | -1 | 0 | unknown | no |
| 2 | 33 | entrepreneur | married | secondary | no | 2 | yes | yes | unknown | 5 | may | 76 | 1 | -1 | 0 | unknown | no |
| 3 | 47 | blue-collar | married | unknown | no | 1506 | yes | no | unknown | 5 | may | 92 | 1 | -1 | 0 | unknown | no |
| 4 | 33 | unknown | single | unknown | no | 1 | no | no | unknown | 5 | may | 198 | 1 | -1 | 0 | unknown | no |
| 5 | 35 | management | married | tertiary | no | 231 | yes | no | unknown | 5 | may | 139 | 1 | -1 | 0 | unknown | no |
| 6 | 28 | management | single | tertiary | no | 447 | yes | yes | unknown | 5 | may | 217 | 1 | -1 | 0 | unknown | no |
| 7 | 42 | entrepreneur | divorced | tertiary | yes | 2 | yes | no | unknown | 5 | may | 380 | 1 | -1 | 0 | unknown | no |
| 8 | 58 | retired | married | primary | no | 121 | yes | no | unknown | 5 | may | 50 | 1 | -1 | 0 | unknown | no |
| 9 | 43 | technician | single | secondary | no | 593 | yes | no | unknown | 5 | may | 55 | 1 | -1 | 0 | unknown | no |
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 45211 entries, 0 to 45210 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 age 45211 non-null int64 1 job 45211 non-null object 2 marital 45211 non-null object 3 education 45211 non-null object 4 default 45211 non-null object 5 balance 45211 non-null int64 6 housing 45211 non-null object 7 loan 45211 non-null object 8 contact 45211 non-null object 9 day 45211 non-null int64 10 month 45211 non-null object 11 duration 45211 non-null int64 12 campaign 45211 non-null int64 13 pdays 45211 non-null int64 14 previous 45211 non-null int64 15 poutcome 45211 non-null object 16 y 45211 non-null object dtypes: int64(7), object(10) memory usage: 5.9+ MB
print(f"This dataset contains {df.shape[0]} columns and {df.shape[1]} rows")
This dataset contains 45211 columns and 17 rows
Here's a description of each column in the Dataset:
Feature Variable:
age: Age of the client. The client's age in years.
job: Type of job. The occupation of the client (e.g., "admin.", "blue-collar", "entrepreneur").
marital: Marital status. The client's marital status (e.g., "married", "single", "divorced").
education: Education level. The highest education level attained by the client (e.g., "primary", "secondary", "tertiary").
default: Credit in default. Whether the client has credit in default ("yes", "no").
balance: Average yearly balance in euros. The average balance of the client's account over the past year.
housing: Housing loan. Whether the client has a housing loan ("yes", "no").
loan: Personal loan. Whether the client has a personal loan ("yes", "no").
contact: Communication type. The type of communication used for the last contact ("telephone", "cellular").
day: Last contact day of the month. The day of the month when the last contact was made durng the current marketing campaign.
month: Last contact month of the year. The month when the last contact was made (e.g., "jan", "feb", "mar").
duration: Last contact duration in seconds. The duration of the last contact in seconds.
campaign: Number of contacts performed during this campaign. The number of times the client was contacted during the current campaign.
pdays: Number of days since the client was last contacted from a previous campaign. The number of days since the client was last contacted in a previous campaign; '-1' indicates the client was not previously contacted.
previous: Number of contacts performed before this campaign. The number of contacts the client received before the current campaign.
poutcome: Outcome of the previous marketing campaign. The result of the previous marketing campaign ("unknown", "other", "failure", "success").
Target Variable:
We'll be going through and cleaning all the important factors that contribute or affect our target variable 'y' and I'll start with our age column.
Note:
Unknownis the value given to null values in our data so wherever we have an unknown value, it means null.
fig = px.violin(df,x='age',title='Age Distribution',color_discrete_sequence = ['dodgerblue'])
fig.show()
#checking for null values
df[['job']][df.job == 'unknown'].count()
job 288 dtype: int64
# The null values aren't much at all so we'll get rid of them
df = df[df.job != 'unknown']
df['job'].value_counts()
job blue-collar 9732 management 9458 technician 7597 admin. 5171 services 4154 retired 2264 self-employed 1579 entrepreneur 1487 unemployed 1303 housemaid 1240 student 938 Name: count, dtype: int64
Blue-Collar Jobs represent the largest group indicating a strong presence of manual labor or skilled trades among the customers in our data and coming in close second is the Management, showing a significant portion of the population involved in leadership and decision-making roles. We also see middle occupations which include the technician to services roles and the smaller occupations consisting of retirees, those engaged in self driven activities, the unemployed and domestic workers.
# Next we want to look at our marital column
fig = px.pie(df,'marital',title = 'Marriage Distribution', color='marital',hole=0.5)
fig.show()
It's unsurprising that the majority of customers are married, with single individuals making up a smaller portion, and divorcees representing the smallest group.
# we check for null values
df[df.education == 'unknown']
| age | job | marital | education | default | balance | housing | loan | contact | day | month | duration | campaign | pdays | previous | poutcome | y | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3 | 47 | blue-collar | married | unknown | no | 1506 | yes | no | unknown | 5 | may | 92 | 1 | -1 | 0 | unknown | no |
| 13 | 58 | technician | married | unknown | no | 71 | yes | no | unknown | 5 | may | 71 | 1 | -1 | 0 | unknown | no |
| 16 | 45 | admin. | single | unknown | no | 13 | yes | no | unknown | 5 | may | 98 | 1 | -1 | 0 | unknown | no |
| 42 | 60 | blue-collar | married | unknown | no | 104 | yes | no | unknown | 5 | may | 22 | 1 | -1 | 0 | unknown | no |
| 44 | 58 | retired | married | unknown | no | 96 | yes | no | unknown | 5 | may | 616 | 1 | -1 | 0 | unknown | no |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 45098 | 44 | technician | single | unknown | no | 11115 | no | no | cellular | 25 | oct | 189 | 1 | 185 | 4 | success | no |
| 45109 | 78 | management | married | unknown | no | 1780 | yes | no | cellular | 25 | oct | 211 | 2 | 185 | 7 | success | yes |
| 45129 | 46 | technician | married | unknown | no | 3308 | no | no | cellular | 27 | oct | 171 | 1 | 91 | 2 | success | yes |
| 45150 | 65 | management | married | unknown | no | 2352 | no | no | cellular | 8 | nov | 354 | 3 | 188 | 13 | success | no |
| 45158 | 34 | student | single | unknown | no | 2321 | no | no | cellular | 9 | nov | 600 | 2 | 99 | 5 | failure | no |
1730 rows × 17 columns
# Since the number of unknown values is small relative to our data, we'll remove it
df = df[df.education != 'unknown']
fig = px.histogram(df,'education',title = 'Education Distribution', color_discrete_sequence=['mediumseagreen'])
fig.show()
# Next we'll be looking at our default credit column to understand it's distribution
df['default'].value_counts()
default no 42411 yes 782 Name: count, dtype: int64
Impressively, only about 1.8% of people have unresolved loans, indicating a strong financial responsibility among the bank's customers.
# next we want to check the distribution of the average balance of customers for the year
fig = px.histogram(df,'balance',title = 'Average Balance Distribution', color_discrete_sequence=['mediumseagreen'])
fig.show()
Our histogram reveals that the most prominent balance range among customers is between 0 and 99, with the number of individuals decreasing as balances increase. Notably, the count of people with higher balances drops significantly from around 4,000 onward, becoming sparse beyond that point. Interestingly, a small segment of the population, accounting for 8.3% of our data, holds negative balances, with some reaching as low as -7,000. This highlights a minority of customers who are in debt.
Let's see what this would look like after filtering our outliers.
Q1 = df['balance'].quantile(0.25)
Q3 = df.balance.quantile(0.75)
IQR = Q3-Q1
upper_fence = Q3 + 1.5 * IQR
lower_fence = Q1 - 1.5 * IQR
d = df[(df.balance > lower_fence) & (df.balance < upper_fence)]
fig = px.histogram(d,'balance',title = 'Average Balance Distribution', color_discrete_sequence=['mediumseagreen'])
fig.show()
#Let's check for null values
df[['contact']][df.contact == 'unknown'].count()
contact 12286 dtype: int64
Woah! Those are a whole lot of unknowns and due to their sheer size, we are unable to simply remove them so we'll replace them with the most occuring category.
df['contact'].replace(['unknown'],df['contact'].mode(),inplace=True)
# Let's check out its distribution
fig = px.pie(df,'contact',title = 'Contact Distribution', color='contact',hole=0.5,
color_discrete_map = {'cellular':'mediumpurple','telephone':'lavender'})
fig.show()
#Let's check for null values
df[['month']][df.month == 'unknown'].count()
month 0 dtype: int64
fig = px.histogram(df,'month',title = 'Month Distribution', color_discrete_sequence=['#EF553B'])
fig.show()
We see that most of our records are in May, with July and August coming in as close seconds, followed by June and the least of all, probably due to holiday cheers is the month of december.
# Now we want to focus on the duration of the contact call
fig = px.histogram(df,'duration',title = 'Duration Distribution', color_discrete_sequence=['mediumpurple'])
fig.show()
Highest duration of calls are from 80 seconds to 130 seconds and that number reduces drastically as the calls get longer.
Let's see what this would look like without extremities.
Q1 = df['duration'].quantile(0.25)
Q3 = df.duration.quantile(0.75)
IQR = Q3-Q1
upper_fence = Q3 + 1.5 * IQR
d = df[df.duration < upper_fence]
fig = px.histogram(d,'duration',title = 'Duration Distribution', color_discrete_sequence=['mediumpurple'])
fig.show()
fig = px.histogram(df,'campaign',title = 'Campaign Distribution', color_discrete_sequence=['royalblue'])
fig.show()
We see most customers were contacted either once or twice during the course of the entire campaign amd the number of people contacted above that quickly thins out as the contacts increase, we see several odd values that shows quite a lot of people were contacted above 20 times.
For this next phase in this project, we'll be taking a different approach by dividing our data into those with the target "yes" and "no" and focusing our magnifying glasses on the differences between these two populations to gain insights into the main factors surrounding the outcome and understand how we'll undergo our feature selection process. We want to understand the special characteristics of those who made a deposit in comparison to those who didn't
dfyes = df[df['y'] == 'yes']
dfno = df[df['y'] == 'no']
print((len(dfno)/len(df)) * 100)
88.3754312041303
We see that our no population fairly represents our normal data occupying 88 % of our overall data.
def viz(title,column,c1,c2):
# Create subplots: 1 row, 2 columns
fig = make_subplots(rows=1, cols=2, subplot_titles=("Population (Yes)", "Population (No)"))
if df[column].dtype == 'int64':
Q1 = df[column].quantile(0.25)
Q3 = df[column].quantile(0.75)
IQR = Q3-Q1
upper_fence = Q3 + 1.5 * IQR
lower_fence = Q1 - 1.5 * IQR
dfn = dfno[(dfno[column] > lower_fence) & (dfno[column] < upper_fence)]
dfy = dfyes[(dfyes[column] > lower_fence) & (dfyes[column] < upper_fence)]
else:
dfn = dfno
dfy = dfyes
# First histogram for 'Yes'
fig.add_trace(
go.Histogram(x=dfy[column], name='Yes', marker_color=c1,marker_line_color='black', marker_line_width=1),
row=1, col=1
)
# Second histogram for 'No'
fig.add_trace(
go.Histogram(x=dfn[column], name='No', marker_color=c2,marker_line_color='black',marker_line_width=1),
row=1, col=2
)
# Update layout
fig.update_layout(title_text= title + " Distribution by Deposit Outcome", showlegend=False)
# Show plot
fig.show()
def vizv(title,column,c1,c2):
# Create subplots: 1 row, 2 columns
fig = make_subplots(rows=1, cols=2, subplot_titles=("Population (Yes)", "Population (No)"))
if df[column].dtype == 'int64':
Q1 = df[column].quantile(0.25)
Q3 = df[column].quantile(0.75)
IQR = Q3-Q1
upper_fence = Q3 + 1.5 * IQR
lower_fence = Q1 - 1.5 * IQR
dfn = dfno[(dfno[column] > lower_fence) & (dfno[column] < upper_fence)]
dfy = dfyes[(dfyes[column] > lower_fence) & (dfyes[column] < upper_fence)]
else:
dfn = dfno
dfy = dfyes
# First histogram for 'Yes'
fig.add_trace(
go.Violin(x=dfy[column], name='Yes', marker_color=c1),
row=1, col=1
)
# Second histogram for 'No'
fig.add_trace(
go.Violin(x=dfn[column], name='No', marker_color=c2),
row=1, col=2
)
# Update layout
fig.update_layout(title_text= title + " Distribution by Deposit Outcome", showlegend=False)
# Show plot
fig.show()
We'll be starting with our Education column.
vizv('Education','education','mediumseagreen','mediumpurple')
Through our visualizations, we're aiming to spot out differences between our "no" population and our "yes" population. With that, let's get into it.
Just in case you might question the importance of this feature to our target, I'd like to highlight that preferrence for a specific method of communication influences how effectively one be receptive for what you're trying to pass across.
For example : People typically prefer a call or a voice note if you're trying to disseminate lengthy information but I would prefer a text or written information and it doesn't matter how long as long as it's consise. This is just to say people have different likes and that may influence their perception.
viz('Contact','contact','royalblue','mediumpurple')
viz('Monthly','month','mediumseagreen','royalblue')
viz('Campaign','campaign','#EF553B','mediumpurple')
viz('Occupation','job','royalblue','#EF553B')
viz('Average Balance','balance','deepskyblue','mediumpurple')
vizv('Mortgage','housing','blue','mediumseagreen')
viz("Loan",'loan','#EF553B','#EF553B')
viz("Daily",'day','blue','mediumseagreen')
viz('Duration','duration','royalblue','royalblue')
Education Level Discrepancy: There is a notable difference in education levels between the 'yes' and 'no' populations. Tertiary education is less common among 'no' responders but less significantly so among 'yes' responders, indicating education level is a useful feature for modeling.
Preferred Contact Method: Cellular contact is the preferred method across both segments, showing no significant difference between 'yes' and 'no' populations.
Monthly Contact Patterns: May has the highest number of contacts, with August and April showing notable differences between 'yes' and 'no' populations. This indicates seasonality in contact effectiveness.
Occupation Influence: Management and technician occupations are more prevalent among 'yes' responders, while blue-collar workers, despite their large numbers, are less likely to say 'yes' compared to their general population.
Mortgage and Loan Status: A significant portion of 'yes' responders do not have a mortgage, contrasting with the 'no' responders. This suggests mortgage status is an important feature for predicting positive outcomes.
Effectiveness of Call Duration: Shorter call durations are less effective in converting individuals to 'yes,' suggesting longer interactions may be more successful in achieving positive responses.
We'll start off this next step by encoding our variables.
# One-hot encode 'job', 'month', and 'contact'
df_encoded = pd.get_dummies(df, columns=['job','contact','month'], drop_first=True)
# Apply label encoding to 'education'
label_encoder = LabelEncoder()
df_encoded['education'] = label_encoder.fit_transform(df['education'])
# Transforming our yes/no columns to 1's and 0's
df_encoded['housing'] = df_encoded['housing'].map({'yes': 1, 'no': 0})
df_encoded['loan'] = df_encoded['loan'].map({'yes': 1, 'no': 0})
df_encoded['y'] = df_encoded['y'].map({'yes': 1, 'no': 0})
df_encoded['default'] = df_encoded['default'].map({'yes': 1, 'no': 0})
# Dropping columns we won't be making use of in our model
df_encoded.drop(columns=['marital','poutcome','previous','pdays','age','campaign'],inplace=True)
# Shuffle the DataFrame
data = df_encoded.sample(frac=1, random_state=42).reset_index(drop=True)
# An example of what our data looks like now
data
| education | default | balance | housing | loan | day | duration | y | job_blue-collar | job_entrepreneur | ... | month_dec | month_feb | month_jan | month_jul | month_jun | month_mar | month_may | month_nov | month_oct | month_sep | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3417 | 0 | 0 | 12 | 134 | 1 | False | False | ... | False | False | False | False | False | False | False | True | False | False |
| 1 | 2 | 0 | 5506 | 0 | 1 | 6 | 141 | 0 | False | True | ... | False | False | False | False | True | False | False | False | False | False |
| 2 | 0 | 0 | 556 | 1 | 0 | 14 | 227 | 0 | True | False | ... | False | False | False | False | False | False | True | False | False | False |
| 3 | 0 | 0 | 1406 | 1 | 0 | 14 | 252 | 0 | True | False | ... | False | False | False | False | False | False | True | False | False | False |
| 4 | 1 | 0 | 397 | 1 | 0 | 23 | 252 | 0 | False | False | ... | False | False | False | False | False | False | True | False | False | False |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 43188 | 1 | 0 | 181 | 1 | 0 | 28 | 230 | 0 | False | False | ... | False | False | False | False | False | False | True | False | False | False |
| 43189 | 1 | 0 | 1483 | 0 | 0 | 20 | 32 | 0 | False | False | ... | False | False | False | False | True | False | False | False | False | False |
| 43190 | 1 | 0 | 2087 | 0 | 0 | 1 | 111 | 0 | False | False | ... | False | False | False | False | True | False | False | False | False | False |
| 43191 | 1 | 0 | 528 | 0 | 0 | 7 | 274 | 0 | False | False | ... | False | False | False | False | False | False | True | False | False | False |
| 43192 | 1 | 0 | 204 | 1 | 0 | 24 | 38 | 0 | False | False | ... | False | False | False | True | False | False | False | False | False | False |
43193 rows × 30 columns
#Define features (X) and target (y)
X = data.drop('y', axis=1)
y = data['y']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)
As we previously observed, the number of people with "No's" takes up 88% of our entire data, and while data modeling is still pretty cool, it's not magic. The dataset is imbalanced, and the population of those who said "Yes" stands firmly in the minority class. Luckily, we have a way to take care of this. Permit me to ramble a bit here. We are going to balance our data by oversampling the minority class using synthetic data, generating fake samples for our training data to ensure our model has enough Yes's to discover a pattern.
Rest assured, we're not just going to duplicate data points; this technique generates new samples by focusing on each instance in our minority class data and its nearest neighbors, which are also in our data, and creates the synthetic data between the two points, ensuring diversity and that the data generated stays within the feature's space.
from imblearn.over_sampling import SMOTE
# Initialize SMOTE
smote = SMOTE(random_state=42)
# Fit and transform the training data
X_train_, y_train_ = smote.fit_resample(X_train, y_train)
# Initialize our model
clf = DecisionTreeClassifier( random_state=42)
# Train the model
clf.fit(X_train_, y_train_)
DecisionTreeClassifier(random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
DecisionTreeClassifier(random_state=42)
# Make predictions using our model
y_pred = clf.predict(X_test)
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
# Generate classification report
report = classification_report(y_test, y_pred)
# Generate confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
print("Classification Report:")
print(report)
plt.figure(figsize=(7, 5))
sns.heatmap(conf_matrix, annot=True, fmt="d", cmap="Blues", cbar=False)
plt.xlabel('Predicted Labels')
plt.ylabel('True Labels')
plt.title('Confusion Matrix')
plt.show()
Accuracy: 0.86
Classification Report:
precision recall f1-score support
0 0.93 0.91 0.92 15281
1 0.40 0.47 0.43 1997
accuracy 0.86 17278
macro avg 0.66 0.69 0.67 17278
weighted avg 0.87 0.86 0.86 17278